ServOMap results for OAEI 2013
نویسندگان
چکیده
We briefly present in this paper ServOMap, a large scale ontology matching system, and the performance it achieved during the OAEI 2013 campaign. This is the second participation in the OAEI campaign. 1 Presentation of the system ServOMap [1] is a large scale ontology matching system designed on top of the ServO Ontology Server system [2], an idea originally developed in [3]. It is able to handle ontologies which contain several hundred of thousands entities. To deal with large ontologies, ServOMap relies on an indexing strategy for reducing the search space and computes an initial set of candidates based on the terminological description of entities of the input ontologies. New components have been introduced since the 2012 version of the system. Among them: – The use of a set of string distance metrics to complement the vectorial based similarity of the IR library we use1, – An improved contextual similarity computation thanks to the introduction of a Machine Learning strategy, – The introduction of a general purpose background knowledge, WordNet [4], to deal with synonymy issues within entities’ annotation, – The use of a logical consistency check component. In 2013, ServOMap participated in the entities matching track and does not implemented a specific adaptation for the Interactive Matching and Multifarm tracks. 1.1 State, purpose, general statement ServOMap is designed with the purpose of facilitating interoperability between different applications which are based on heterogeneous knowledge organization systems (KOS). The heterogeneity of these KOS may have several causes including their language format and their level of formalism. Our system relies on Information Retrieval (IR) techniques and a dynamic description of entities of different KOS for computing the similarity between them. It is mainly designed for meeting the need of matching large scale ontologies. It has proven to be efficient for tackling such an issue during the 2012 OAEI campaign. 1 http://lucene.apache.org/ 1.2 Specific techniques used ServOMap has a set of components highly configurable. The overall workflow is depicted on figure 1. It includes three steps briefly described in the following. Typically, the input of the process is two ontologies which can be described in OWL, RDF(S), SKOS or OBO. ServOMap provides a set of weighted correspondences [5] between the entities of these input ontologies. Fig. 1. ServOMap matching process. Initialization Step. During the initialization step, the Ontology Loading component has in charge of processing the input ontologies. For each entity (concept, property, individual), a virtual document from the set of annotations is generated for indexing purpose. These annotations include the ID, labels, comments and, if the entity is a concept, information about it properties. For an individual, the values of domain and range are considered as well. Metadata Generation. A set of metrics are computed. They include the size of input ontologies in term of concepts, properties and individuals, the list of languages denoting the annotations of entities (labels, comments), etc. Determining the size helps adapting latter the matching strategy. Indeed, besides detecting an instances matching case, we distinguish this year small (less than 500 concepts) from large ontologies. Detecting the set of languages allows using latter the appropriate list of stopwords. Ontology Indexing. With ServOMap we consider an ontology as a corpus of semantic document to process. Therefore, the purpose of the indexing module is to build an inverted index for each input ontology from the virtual documents generated previously. The content of each virtual document is passed through a set of filters: stopwords removal, non alphanumeric characters removal, lowercasing and stemming labels, converting numbers to characters. In addition, labels denoting concepts are enriched by their permutation. This operation is applied to the first 4 words of each label. For instance, after enriching the term ’Bone Marrow Donation’ we obtain the set {Bone Marrow Donation, Marrow Bone Donation, Marrow Donation Bone, Donation Marrow Bone, Donation Bone Marrow}. Further, two strategies are used for indexing, exact and relaxed indexing. Exact indexing allows high precise retrieving. In this case, before the indexing process, all words for each label are concatained by removing spaces between them. In addition, for optimization purpose, the possibility is offered to index each entity with information about its siblings, descendants and ancestors. Candidates Retrieving. The objective is to compute a set of candidates mappings M = ⋃ (Mexact, Mrelaxed, Mcontext, Mprop) . Lexical Similarity Computing. Let’s assume that after the initializing step we have two indexes I1 and I2 corresponding respectively to the input ontologies O1 and O2. The first step for candidates retrieving is to compute the initial set of candidates mappings constituted by only couple of concepts and denoted by Mexact. This set is obtained by performing an exact search, respectively over I1 using O2 as search component and over I2 using O1. To do so, a query which takes the form of a virtual document is generated for each concept and sent to the target index. The search is performed through the IR library which use the usual tf.idf score. We select the best K results having a score greater than a given threshold θ. The obtained couples are filtered out in order to keep only those satisfying the lexical similarity condition. This condition is checked as follows. For each filtered couple (c1, c2), two lexical descriptions are generated. They are constituted respectively by ID and labels of c1 and its direct ancestors (Γ 1), ID and labels of c2 and its direct ancestors (Γ 2). We compute a similarity Simlex=f(α× ISub(Γ 1, Γ 2), β ×QGram(Γ 1, Γ 2), γ × Lev(Γ 1, Γ 2)), where I-Sub, QGram and Lev denote respectively the ISUB similarity measure [6], the QGram and Levenshtein distance. Coefficients α, β and γ are chosen empirically for OAEI 2013. All couples with Simlex greater than a threshold are selected. Finally, Mexact is the intersection of the two set of selected couples obtained after the search performed on the two indexes. The same process is repeated in order to compute the set Mrelaxed from the concepts not yet selected with the exact search. A similar strategy for computing Mexact is used for computing the similarity between the properties of the input ontologies. This generates the Mprop set. Here, the description of a property includes its domain and range. Extended Similarity Computing. In order to deal with synonym issue, from the set of concepts not selected after the previous phase, we use the WordNet dictionary for retrieving alternative labels for concepts to be mapped. The idea is to check whether a concept in the first ontology is denoted by synonym terms in the second one. All couples in this case are retrieved as possible candidates. Contextual Similarity Computing. The idea is to acquire new candidates mappings, Mcontext, among those couples which have not been selected in the previous steps. To do so, we rely on the structure of the ontology by considering that the similarity of two entities depends on the similarity of the entities that surround them. In 2013, we have introduced a Machine Learning strategy which uses Mexact as basis for training set using the WEKA tool [7]. Indeed, according to our tests, candidates mappings from Mexact use to be highly accurate. Therefore, retrieving candidates using contextual similarity is transformed as a classification problem. Each new couple is to be classified as correct or incorrect according to candidates already in Mexact. We use 5 similarity measures (Levenshtein, Monge-Elkan, QGram, Jackard and BlockDistance) to compute the features of the training set. For each couple (c1, c2) ∈ Mexact, we compute the 5 scores using the ID and labels associated to c1 and c2 and denote this entry as correct. We complete Mexact by randomly generating new couples assumed to be incorrect. To do so, for each couple (c1, c2) in Mexact, we compute the 5 scores for (c1, ancestor(c2)), (ancestor(c1), c2), (descendant(c1), c2) and (c1, descendant(c2)) and denote them as incorrect. The ancestor and descendant functions retrieve the super-concepts and sub-concepts of a given concept. We use the J48 decision tree algorithm of Weka for generating the classifier. Fig. 2. Strategy for contextual based candidates generation. For each couple of Mexact, the similarity of the surrounding concepts are looked up. We build the dataset to classify as follows. The exact set is used to learn new candidates couples according to the strategy depicted on figure 2 by assuming here for instance that (a6, b6) ∈ Mexact. For each couple of Mexact, the idea is to retrieve possible couples not already in Mexact among the sub-concepts ((a7, b7), (a7, b8), (a8, b8), (a8, b7) in figure 2), the super-concepts and the siblings. For each candidate couple (c1, c2), if the score s = f(getScoreDesc(), getScoreAsc(), getScoreSib()) is greater than a fixed threshold, then we compute the 5 similarity scores for (c1, c2). The functions getScoreDesc(), getScoreAsc(), getScoreSib() compute respectively a score for (c1, c2) from its descendants, ancestors and siblings concepts. The obtained dataset is classified using the previously built classifier. Post-Processing Step . This step involves enriching the set of candidates mapping (mainly incorporating those couples having all their sub-concepts mapped), the selection of the final candidates from the set M and performing inconsistency check. We have implemented a new filtering algorithm for selecting the best candidates based on their scores and we perform consistency check as already implemented in the 2012 version (disjoints concepts, criss-cross). Further, we use the repair facility of the LogMap system [8] to perform logical inconsistency check. Finally, we have implemented an evaluator for computing the usual Precision/Recall/F-measure for the generated final mappings if a reference alignment is provided. 1.3 Adaptations made for the evaluation ServOMap is configured to adapt its strategy to the size of the input ontologies. Therefore, as mentioned earlier, two categories are considered: input ontology with size less than 500 concepts and ontology with size greater than 500 concepts. For large ontologies, our tests showed that exact search is sufficient for generating concepts mappings of OAEI test cases, while for small one relaxed and extended search is needed. Further, according to the performance achieved by our system in OAEI 2012 [9], the focus of this year was more to improve the recall than optimizing the computation time. From technical point of view, the previous version of ServOMap was based on the following third party components: the JENA framework for processing ontologies and the Apache Luncene API as IR library. We have moved from JENA framework to the OWLAPI library for ontology processing, in particular for handling in an efficient manner complex domain and range axioms and taking into account wider formats of input ontologies. In addition, a more recent version of the IR library is used for the actual version. However, in order to have a compatible SEALS client, we have downgraded the version of the Apache Lucene API used for the evaluation. This leaded to a less robust system for the 2013 campaign as some components have not been fully adapted. 1.4 Link to the system and parameters file The wrapped SEALS client for ServOMap version used for the OAEI 2013 edition is available at http://lesim.isped.u-bordeaux2.fr/ServOMap. The instructions for testing the tool is described in the tutorial dedicated to the SEALS client2. 1.5 Link to the set of provided alignments The results obtained by ServOMap during OAEI 2013 are available at http://lesim.isped.ubordeaux2.fr/ServOMap/oaei2013.zip/.
منابع مشابه
ServOMBI at OAEI 2015
We describe in this paper the ServOMBI system and the results achieved during the 2015 edition of the Ontology Alignment Evaluation Initiative. ServOMBI reuse components from the ServOMap ontology matching system, which uses to participate in the OAEI campaign, and implements new features. This is the first participation of the ServOMBI in the OAEI challenge. 1 Presentation of the System ServOM...
متن کاملServOMap and ServOMap-lt results for OAEI 2012
We present the results obtained by the ontology matching tools ServOMap and ServOMap-lite within the 8 edition of the Ontology Alignment Evaluation Initiative (OAEI 2012) campaign. The mappings computation is based on Information Retrieval techniques thanks to the use of a dynamic knowledge repository tool, ServO. This is the first participation of the two systems. 1 Presentation of the systems...
متن کاملTowards Learning Based Strategy for Improving the Recall of the ServOMap Matching System
In order to solve interoperability issues among heterogeneous knowledge based applications, it is important to find correspondences between their underlying ontologies. This is the aim of the ServOMap system, a generic approach for large scale ontologies matching. However, although achieving good results on the official Ontology Alignment Evaluation Initiative dataset, ServOMap performance rema...
متن کاملResults of the Ontology Alignment Evaluation Initiative 2013
Ontology matching consists of finding correspondences between semantically related entities of two ontologies. OAEI campaigns aim at comparing ontology matching systems on precisely defined test cases. These test cases can use ontologies of different nature (from simple thesauri to expressive OWL ontologies) and use different modalities, e.g., blind evaluation, open evaluation and consensus. OA...
متن کاملAutomating OAEI Campaigns
This paper reports the first effort into integrating OAEI and SEALS evaluation campaigns. OAEI is an annual evaluation campaign for ontology matching systems. The 2010 campaign includes a new modality in coordination with the SEALS project. This project aims at providing standardized resources (software components and data sets) for automatically executing evaluations of typical semantic web to...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013